Lab 1: Intro to R and data analysis
Website construction in progress…
Lecture 1: topics illustrated in class
- Introduction to R and R-studio
- Why R?
- Principles of reproducible analysis with R + RStudio
- R objects, functions, packages
- Understanding different types of variables
- Principles of “tidy data”
- Descriptive statistics
- Measures of central tendency, measures of variability (or spread), and frequency distribution
- Visual data exploration
- {
ggplot2}
- {
- Foundations of inference
___
Lab 1 datasets
Below are the datasets used in the Practice session:
- Download a whole
subfolder
The first section is a quick review of the installation process.
___
Introduction to R and R-studio
Install
R is available for free for Windows , GNU/Linux , and macOS .
- To install R, you can go to this link. The latest available release is R 4.3.3 “Angel Food Cake” released on 2024-02/29, but any (fairly recent) version will do.
If you have previously installed R on your machine, you can check which version you are running by executing this command in R:
# From the R console
base::R.version.string
# (This is the version on my own machine)
# [1] "R version 4.2.2 (2022-10-31)"…or by executing this command in your CLI (Command Line Interface):
# From Terminal/Powershell/bash
R --versionInstall RStudio IDE
While not strictly required, it is highly recommended that you also install RStudio to facilitate your work. RStudio Desktop is an Integrated Development Editor (IDE), basically a graphical interface wrapping and interfacing R (which needs to be installed first).
R, which is a command line driven program, can be executed via its native interface (R GUI), as well as from many other code editors, like VS Code, Sublime Text, Jupyter Notebook, etc. RStudio remains the most widely used by beginners and advanced programmers alike, because of its intuitive and integrated interface.
- To install RStudio you can go to this link. The free-version contains everything you need.
Managing files and projects
In any analytical endeavor it is very likely that you will handle a collection of files (likely organized in folders, such as input_data, output_data, R_scripts, paper, etc.). R provides a fantastic tool for organizing all the files pertaining to a project called “R project”
Creating an R Project
An R Project will keep all the files associated with a project (including invisible ones!) organized together – input data, R scripts, analytical results, figures. Besides being common practice, this has the advantage of implicitly setting the “working directory”, which is incredibly important when you need to load or output files, specifying their file path.
In Figure 1 you can see how easy it is just following RStudio prompts:
- Create a new directory for each project
- Select parent folder
- Notice that, now, in the
Filestab you see file with the extension.Rprojwhich is telling R that all folder’s files belong together.
Install R packages
An R package is a shareable bundle of functions. Besides the basic built-in functions already contained in the program (i.e. the base package), many useful R functions come in free libraries of code (or packages) written by R’s users. You can find them in different repositories:
- CRAN (Comprehensive R Archive Network) - the general package repository for R: https://cran.r-project.org/.
- Bioconductor - a package repository geared towards biostatistics https://www.bioconductor.org/.
- GitHub https://github.com/ - a website and cloud-based service that helps developers store and manage their code. Here you will find R package in development stage or the newest version of an existing one (it may be less stable!).
- and more…
Let’s take for example the R package here, a package that hlps handling files’ paths in a reproducible manner. To install it for the first time, open an R session and execute:
From CRAN (stable version)
# Installing (ONLY the 1st time)
utils::install.packages('here')
# OR (same)
install.packages('here')Here you are actually using a function (install.packages) of a pre-installed package (utils) using the syntax packagename::function_name. This prevents any ambiguity in case of dplicate funciton name… also helps you see what you are using.
Once you have installed a package, at every subsequent R session, you will only need to load it, like so:
Using the graphical interface
You can also install and update packages using the “Packages” tab on the lower right pane of RStudio.
From GitHub (testing version)
You can use the package devtools and its function install_github to install from the remote repository of GitHub the developer’s version of a package. Let’s try it with a nice little package paint (which colors the structure of dataset when printing).
# Installing devtools (ONLY the 1st time)
utils::install.packages('devtools')
# Installing paint from GitHub
library(devtools)
devtools::install_github("MilesMcBain/paint")
# test paint out
library(paint)
# it will show me the structure of a data.frame like this...
paint(mtcars)
# ... instead of plain old
print(str(mtcars))After
devtools::install_github("MilesMcBain/paint"), R asks me if I want to update related packages… respond in the console choosing the preferred answer.
Help on R package/function
To inquire about a package and/or its functions, you can again write in your console ?package_name or ??package_name and RStudio will open up the Help page in the lower right pane.
# Opening Help page on package/function
?here
??hereDefining (reproducible) file paths: here
It is never good practice to “hard code” the file’s absolute path: most likely this will break your code as soon as you (or someone else) need to run it on a different computer, let alone within a different OS.
So if your code to read & load a file is written like this:
# [NOT REPRODUCIBLE] hard coding your file path -----------------------
# File path on Mac:
dataset <- readr::read_csv(
"/Users/testuser/R4biostats/input_data/dataset.csv")
# Same file path on Windows:
dataset <- readr::read_csv(
"C:\Users\testuser\R4biostats\input_data\dataset.csv")…it won’t work on someone else’s computer since they don’t have that same file structure!
This is where the fantastic here package intervenes and lets you reference file paths in a reproducible manner (anchored on the R Project’s folder as the root). 1. It let’s you use relative paths, i.e. specify the file path relative to the project folder containing project_name.Rproj. 2. No more “/” v. “\” issue (where Windows and Linus/Mac OSs differ) 3. Add sub folder levels separated by “,”
Make sure we have R packages needed for the Lab
To install an R package, open an R session execute:
___
R objects, functions, packages
This was discussed in Lecture 1).
Now we will…
Read a dataset into R workspace
Let’s start by loading the file we will work on.
>>>>>>>>> [[[[ QUI! ]]]]
Understanding different types of variables
Principles of “tidy data”
___
Descriptive statistics
Measures of central tendency, measures of variability (or spread), and frequency distribution
___
Visual data exploration
ggplot2
___
Foundations of inference
___
Lab 1 complete R code
Here you will find the solved problems addressed in Lab 1
- as
.Rfile